-
Notifications
You must be signed in to change notification settings - Fork 510
Add tau2-bench training cookbook and implementation #1156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR adds a complete implementation and training cookbook for tau2-bench, demonstrating how to train multi-turn tool-use agents using the SFT → RFT → GRPO pipeline. Key additions: - Complete tau2-bench training cookbook with methodology deep-dive - Tau2 implementation (rollout, eval, reward shaping, actions) - Training scripts for SFT and GRPO with shaped rewards - Unified eval.py supporting Pass@1 (greedy) and Pass@K (multi-sampling) - Performance results: 57.1% Pass@4 (4× baseline improvement) Resources: - Training data: tau2-sft-seed-v3 (~3K filtered trajectories) - Checkpoints: Qwen3-4B-tau2-sft1, Qwen3-4B-tau2-grpo-v1 - WandB logs: Full training metrics and sample outputs The cookbook explains credit assignment in multi-turn tasks, progressive training benefits, and turn-level reward shaping with research citations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Standardize on Qwen3 native function calling format only - Remove legacy action format support - Simplify action parsing and observation formatting - Clean up reward shaping and prompting utilities - Reduce code complexity across tau2 modules This commit reduces code by 1,369 lines while preserving all functionality.
f56c41a to
10f5405
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a comprehensive implementation and training cookbook for tau2-bench, demonstrating how to train multi-turn tool-use agents using a progressive SFT → RFT → GRPO pipeline. The implementation achieves 57.1% Pass@4 on tau2-bench (4× baseline improvement) with a 4B parameter model.
Key Changes:
- Complete tau2-bench training pipeline with SFT, rejection sampling (RFT), and GRPO stages
- Unified evaluation harness supporting both Pass@1 (greedy) and Pass@K (multi-sampling) metrics
- Dense reward shaping using turn-level partial scores with domain-adaptive weighting
Reviewed changes
Copilot reviewed 16 out of 23 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/tau-bench/training_cookbook.md | Comprehensive training guide with methodology, performance results, and reproduction instructions |
| examples/tau-bench/tau2/tasks.py | Task preprocessing script to generate JSONL index files for training |
| examples/tau-bench/tau2/rollout.py | Custom rollout function for GRPO with tau2-bench environment integration |
| examples/tau-bench/tau2/reward.py | Reward shaping implementation with domain-adaptive alpha and curriculum learning |
| examples/tau-bench/tau2/prompting.py | Compressed system prompts for reduced KV cache pressure during RL training |
| examples/tau-bench/tau2/actions.py | Action parsing supporting both native FC and legacy formats with robust error handling |
| examples/tau-bench/tau2/env.py | Environment wrapper utilities for tau2-bench with partial score computation |
| examples/tau-bench/tau2/eval.py | Unified evaluation script supporting Pass@K sampling with WandB/Weave integration |
| examples/tau-bench/tau2/run_sft.sh | SFT training script using filtered trajectories from rejection sampling |
| examples/tau-bench/tau2/run_grpo.sh | GRPO training script with shaped rewards and curriculum learning |
| examples/tau-bench/tau2/start_user_sim_server.sh | User simulator server startup script for multi-turn RL rollouts |
| examples/tau-bench/tau2/.env.template | Environment variable template for API keys and configuration |
| examples/tau-bench/tau2/README.md | Component overview and usage instructions |
| examples/tau-bench/README.md | Updated main README with tau1/tau2 benchmark comparison |
| examples/tau-bench/.gitignore | Ignore patterns for outputs and local files |
| examples/tau-bench/tau1/* | Legacy tau1 implementation files (context) |
Comments suppressed due to low confidence (4)
examples/tau-bench/tau2/eval.py:438
- The default user model is set to "gpt-4.1-mini" which does not exist. OpenAI's model naming convention is "gpt-4o-mini" or "gpt-4-turbo". The version "4.1" is not a valid OpenAI model identifier.
examples/tau-bench/tau2/env.py:192 - The variable name "denom" is abbreviated and unclear. Consider using "total_weight" or "weight_sum" for better code readability.
examples/tau-bench/tau2/reward.py:324 - The WandB logging code accesses
_curriculum_tracker._lockdirectly (line 302), which breaks encapsulation and could lead to maintenance issues. The underscore prefix indicates this is a private attribute that shouldn't be accessed outside the class. Consider adding a public method to the _TaskCurriculumTracker class that provides the needed statistics in a thread-safe manner.
examples/tau-bench/tau2/reward.py:327 - 'except' clause does nothing but pass and there is no explanatory comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Updated the README to improve clarity and formatting.
|
This is awesome — basically a Tau2 mega-pack 🚀 |
@Fengzdadi Ah, thank you! Very much appreciate the patch! |
|
Great work on this! I had a couple of quick questions, if you don’t mind:
|
- Fix tau2 SFT dataset defaults and document exact file selection - Make GRPO WandB logging optional to match docs - Clarify GPU allocation, eval temperature, and TAU2_DATA_DIR in cookbook - Add tau1 episode logging and resilient tool parsing for offline runs
- Bind Ray dashboard to localhost by default - Clarify public HF checkpoints in cookbook - Note tau1 stub provider for offline debugging
- Fix tau2 SFT dataset defaults and document exact file selection - Make GRPO WandB logging optional to match docs - Clarify GPU allocation, eval temperature, and TAU2_DATA_DIR in cookbook - Add tau1 episode logging and resilient tool parsing for offline runs
- Bind Ray dashboard to localhost by default - Clarify public HF checkpoints in cookbook - Note tau1 stub provider for offline debugging
@maocheng23 My apologies! I thought they were pubic:
|
|
Hi @jbarnes850 , I followed the instructions in your cookbook and trained the model from scratch, here's the result I got:
I'm using the
And you may find the training logs here I have a few questions if you don't mind:
Thank you so much! |
Hi @zijiexia, thanks for running this and for the detailed table + logs! Awesome work and apologies in advance for the confusion here. Quick clarifications to your questions below:
|
Thanks for the clarification! |
|
I've rerun the evaluation, here's the results I got: Settings: --num-samples 1
--temperature 1.0
--top-p 1.0
|
|
Thanks so much for the feedback and re-run here! A couple clarifications that should make the comparisons apples-to-apples:
I just re-ran the full eval on an A100 with gpt-4.1-mini and the cookbook pass@4 settings. I get pass@1 = 0.27 and pass@4 = 0.55 on 100 tasks. W&B run: https://wandb.ai/jbarnes850-near-protocol/slime-tau2-eval/runs/2d534fuo Also addressing the other review feedback from the thread:
Happy to talk through this further if any other revisions are needed! |
|
Thank you for the PR, but I'm afraid I need to close it as this does not serve as a good example for slime, which need to be simple and show some specific features. This PR could be a really nice isolated repo, similar to Alibaba-NLP/qqr. We'd love to add a reference of this implementation in the readme. |
This PR adds a complete implementation and training cookbook for tau2-bench, demonstrating how to train multi-turn tool-use agents using the SFT → RFT → GRPO pipeline.
Key additions
Resources
Highlights
The cookbook explains credit assignment in multi-turn tasks, progressive training benefits, and turn-level reward shaping.